feat: page-bounded Arrow decoder per data page (PR-6a.2) by g-talbot · Pull Request #6407 · quickwit-oss/quickwit

g-talbot · 2026-05-08T01:11:08Z

Summary

Rebuilds the page-stream → Arrow decoder to be page-bounded — StreamDecoder::decode_next_page() returns one [DecodedPage] per call (rg_idx, col_idx, page_idx, row_start, ArrayRef) instead of materialising an entire row group at a time.
Memory: ~one in-flight page (compressed + decompressed bytes) + one cached dictionary page per (rg, col) when dict-encoded. The decoder does NOT buffer a row group, column chunk, or any materialised array beyond the one currently being emitted.
PR-6b's merge engine consumes [DecodedPage]s in storage order (row-group-major, column-major-within-rg, page-major-within-col), applies merge plan slicing per page, and streams output pages directly into the writer without column-chunk staging.

How it works

Pull one [Page] from the underlying [ColumnPageStream]. Skip INDEX_PAGE (historical Thrift variant, not emitted by production writers).
Look up or initialise per-(rg, col) state: a PageQueue that feeds parquet-rs's [ColumnReader] one page at a time, plus a counter tracking rows decoded so far.
Convert the [Page] to parquet-rs's column::page::Page enum: decompress via [parquet::compression::create_codec] (requires the experimental feature), translate format::Encoding (Thrift wrapper) → basic::Encoding (Rust enum) via a manual i32 match (no public conversion in parquet-rs), drop optional statistics.
Push the converted page onto the queue. Dictionary/index pages absorb silently for use by subsequent data pages.
For a data page: ask the [ColumnReader] to decode exactly header.num_values records via read_records(...) calls in a loop, pulling values + def/rep levels into typed buffers.
Build an ArrayRef from (values, def_levels, rep_levels) per the column's parquet physical type. Emit [DecodedPage].

Type coverage

Flat physical types: Boolean, Int8/16/32 + UInt8/16/32 (parquet Int32 with logical annotation), Int64/UInt64 (parquet Int64), Float32, Float64, Utf8/LargeUtf8/Binary/LargeBinary (parquet ByteArray). Dictionary-encoded pages are decoded via the cached dict page → values pipeline.

List<T> / LargeList<T> where outer + inner are non-nullable and inner is a flat primitive — covers DDSketch keys (List<Int16>) and counts (List<UInt64>). Dremel def/rep levels (max_def=1, max_rep=1) are decoded via the same read_records path; arrow offsets are computed via list_offsets_from_levels.

Other nested shapes (nullable list inner/outer, Struct, Map, FixedSizeList, multi-leaf nested) return an unsupported-type error rather than silently falling back to a different mechanism.

Sync ⇄ async bridging

The page stream is async (S3 reads); PageReader (from parquet-rs) is sync. Bridged via Arc<Mutex<VecDeque<ColumnPage>>> per (rg, col): decode_next_page pulls from the stream (async), pushes onto the queue, then the sync PageReader impl pops from the queue when the ColumnReader asks for the next page. peek_next_page and skip_next_page are properly implemented to support the parquet-rs reader's state machine.

Schema handling

parquet_to_arrow_schema(parquet_schema, None) bypasses the ARROW:schema hint that would otherwise force Dictionary types — input parquet files written from arrow declare Dictionary columns in ARROW:schema metadata, but their page-encoded values are plain values that decode to StringArray/etc. Decoding without the hint gives consistent flat-primitive output that the merge engine then interleaves.

Tests

9 tests, all passing:

test_drain_single_rg_round_trip, test_drain_multi_rg_round_trip — full round-trip via decode_next_page matches ParquetRecordBatchReaderBuilder.
test_decoded_page_row_indexing — row_start correctly tracks per-(rg, col) row offsets.
test_eof_idempotent — repeated calls after EOF stay Ok(None).
test_nullable_column_round_trip — def-level decoding for nullable cols.
test_compression_codecs — snap, gzip, zstd round-trip.
test_page_bounded_queue_depth — verifies the internal page queue depth stays ≤ 2 across a long stream (the page-bounded contract).
test_list_uint64_round_trip — List<UInt64> (DDSketch shape) round-trip.
test_io_failure_surfaces_as_page_stream_error — body GET failures propagate as PageStream(Io), not masked as decode errors.

Stack

Base: gtt/column-page-stream-trait (PR-5a #6406).

PR-6b (#6409) builds the streaming merge engine on top of this decoder.

Bridges PR-4's ColumnPageStream (raw compressed pages in storage order) to arrow's standard ParquetRecordBatchReaderBuilder (decoded arrays). PR-6's streaming merge engine drains each input row-group through this to keep per-RG memory bounded — only one input RG worth of bytes is materialised at a time, rather than the whole file. Approach: reconstruct one row group's column-chunk byte layout in a buffer (column chunks placed at their original offsets, gaps zero- padded), wrap the buffer in `Bytes`, and feed it to `ParquetRecordBatchReaderBuilder::new_with_metadata` with `with_row_groups([rg_idx])`. Byte-exact reconstruction by carrying each page's original Thrift-compact `header_bytes` through PR-4's streaming reader — no re-encoding, so encoder-version drift inside the compactor cannot silently corrupt outputs. Adds `header_bytes: Bytes` to `Page` and captures the drained header bytes inside `parse_page_header_streaming`. New `StreamDecoder` borrows the stream and exposes `next_rg()` returning one `RecordBatch` per input row group, idempotent at EOF. Tests (9, all passing): single-RG and multi-RG drains, multi-page columns, dict columns, null preservation, compression codec roundtrip (uncompressed/snappy/zstd — LZ4 not enabled in our parquet feature set), idempotent EOF, byte-exact reconstruction proof, and I/O failure surfacing as PageDecodeError::PageStream rather than masked as decode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI nightly rustfmt (newer than my local at the time of the original push) wraps `write_parquet(...)` onto multiple lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces PR-6a's per-RG fat-buffer approach. The previous implementation reconstructed a whole row group's column-chunk bytes into a single buffer and fed it to ParquetRecordBatchReaderBuilder — peak memory was RG-size (tens to hundreds of MB per call). This rewrite is page-bounded. API change: \`StreamDecoder::next_rg() -> Option<RecordBatch>\` is replaced by \`decode_next_page() -> Option<DecodedPage>\`. Each call returns one input data page's worth of decoded rows as an \`ArrayRef\`, plus \`(rg_idx, col_idx, page_idx_in_col, row_start)\` indexing so PR-6b's merge engine can slice take indices per page. Dictionary pages are absorbed silently (fed to the column reader for subsequent data-page decoding); INDEX_PAGE is skipped. Memory at any time: - One in-flight page (compressed + decompressed bytes) - One cached dictionary page per (rg, col) when dict-encoded - One column reader per (rg, col) with small bookkeeping (level decoders, value decoder) Does NOT buffer the row group, a column chunk, or a materialised RecordBatch. Implementation: wraps parquet-rs's public \`GenericColumnReader\` over a per-(rg, col) PageQueue we feed one page at a time. Page → ColumnPage conversion handles decompression (via \`compression::create_codec\`, which required enabling parquet's \`experimental\` feature on our Cargo.toml — the API has been stable across recent parquet-rs versions, just not yet de-experimentalised), \`format::Encoding\` (Thrift wrapper) → \`basic::Encoding\` translation, and DataPageV2's unencrypted-levels-then-compressed-values layout. Array builders cover the production schema: Boolean, Int8/16/32/64, UInt8/16/32/64, Float32/64, Utf8/LargeUtf8/Binary/LargeBinary, and \`List<non-nullable primitive>\` (DDSketch \`keys\` / \`counts\`). Dict columns decode to their value type (Utf8/Binary); the merge engine's union schema normalises strings to Utf8 anyway, and the output writer re-applies dict encoding based on observed cardinality. Tests (9, all passing): - single-RG and multi-RG round-trip (per-column comparison vs. canonical arrow reader) - per-page indexing (\`row_start\`, \`page_idx_in_col\` monotonic per-(rg, col)) - idempotent EOF - nullable column (\`service\` with nulls every 5th row) - compression codecs (uncompressed, snappy, zstd) - I/O failures surface as \`PageDecodeError::PageStream\` - \`List<UInt64>\` (DDSketch \`counts\`) with variable list lengths including empty list and \`u64::MAX\` - structural page-bounded contract: PageQueue depth ≤ 2 (one queued dictionary plus the current data page) across a long stream Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI's `cargo +nightly fmt --check` flags a single trailing blank line at end of file. No functional change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot force-pushed the gtt/column-page-stream-trait branch from af78c8f to 2714921 Compare May 8, 2026 20:49

g-talbot force-pushed the gtt/parquet-page-decoder branch from 4bc7122 to 5d5c4b1 Compare May 8, 2026 20:49

g-talbot force-pushed the gtt/column-page-stream-trait branch from 2714921 to 61b6310 Compare May 8, 2026 21:27

g-talbot force-pushed the gtt/parquet-page-decoder branch from 5d5c4b1 to e660f78 Compare May 8, 2026 21:28

g-talbot force-pushed the gtt/column-page-stream-trait branch from 61b6310 to 4ae07e7 Compare May 8, 2026 21:46

g-talbot force-pushed the gtt/parquet-page-decoder branch from e660f78 to fcfb854 Compare May 8, 2026 21:46

g-talbot force-pushed the gtt/column-page-stream-trait branch from 4ae07e7 to d43186d Compare May 9, 2026 00:07

g-talbot force-pushed the gtt/parquet-page-decoder branch from fcfb854 to 736ce0e Compare May 9, 2026 00:08

g-talbot force-pushed the gtt/column-page-stream-trait branch from d43186d to 11b9d53 Compare May 11, 2026 11:06

g-talbot force-pushed the gtt/parquet-page-decoder branch from 736ce0e to 7bcf723 Compare May 11, 2026 11:06

g-talbot force-pushed the gtt/column-page-stream-trait branch from 11b9d53 to 123ed7e Compare May 11, 2026 11:14

g-talbot and others added 2 commits May 11, 2026 07:15

style: nightly fmt fixup

67ac5b0

CI nightly rustfmt (newer than my local at the time of the original push) wraps `write_parquet(...)` onto multiple lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot force-pushed the gtt/parquet-page-decoder branch from 7bcf723 to 67ac5b0 Compare May 11, 2026 11:15

g-talbot and others added 2 commits May 11, 2026 09:47

style: drop trailing blank line in page_decoder.rs

309372e

CI's `cargo +nightly fmt --check` flags a single trailing blank line at end of file. No functional change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot changed the title ~~feat: page-stream → RecordBatch decoder (PR-6a)~~ feat: page-bounded Arrow decoder per data page (PR-6a.2) May 11, 2026

g-talbot mentioned this pull request May 11, 2026

feat: streaming column-major merge engine with page-bounded body cols (PR-6b.2) #6409

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: page-bounded Arrow decoder per data page (PR-6a.2)#6407

feat: page-bounded Arrow decoder per data page (PR-6a.2)#6407
g-talbot wants to merge 4 commits into
gtt/column-page-stream-traitfrom
gtt/parquet-page-decoder

g-talbot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-talbot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Type coverage

Sync ⇄ async bridging

Schema handling

Tests

Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

g-talbot commented May 8, 2026 •

edited

Loading